Multivariate Chemometrics with Regression and Classification Analyses in Heroin Profiling Based on the Chromatographic Data.

The purpose of this work is to promote and facilitate forensic profiling and chemical analysis of illicit drug samples in order to determine their origin, methods of production and transfer through the country. The article is based on the gas chromatography analysis of heroin samples seized from three different locations in Serbia. Chemometric approach with appropriate statistical tools (multiple-linear regression (MLR), hierarchical cluster analysis (HCA) and Wald-Wolfowitz run (WWR) test) were applied on chromatographic data of heroin samples in order to correlate and examine the geographic origin of seized heroin samples. The best MLR models were further validated by leave-one-out technique as well as by the calculation of basic statistical parameters for the established models. To confirm the predictive power of the models, external set of heroin samples was used. High agreement between experimental and predicted values of acetyl thebaol and diacetyl morphine peak ratio, obtained in the validation procedure, indicated the good quality of derived MLR models. WWR test showed which examined heroin samples come from the same population, and HCA was applied in order to overview the similarities among the studied heroine samples.


Introduction
Illicit drug profiling provides law and police authorities essential physicochemical information that may assist in identification and disruption of drug trafficking in one country. Results of chemical analysis may allow investigators to determine geographical origin of the illicit drug, synthetic path and chemical precursors of synthetic drugs. The physical evidence combined with chemical analysis can be also used to establish links between different seizures of illicit drugs. This is the first attempt in Serbia to acquire chemical and profiling data on seized heroin samples and disseminate information to appropriate national and regional governmental agencies. In this paper, authors endeavored to determine the chemical fingerprints or signatures of seized heroin samples in three different locations in Serbia: border crossings: Batrovci and Horgoš, but also Novi Sad municipality, labelled as batch (1), (2) and (3). Serbia was chosen as the entering point to European Union, since the common trafficking routes from Middle East to EU are passing its territory mostly through the mentioned border crossings (1) with Croatia and (2) with Hungary. It is also known that heroin can be easily prepared at the many "homemade" laboratories, some of which have existed on the territory of Novi Sad municipality.
Heroin is a semi-synthetic derivative of morphine. Due to differences in the way of growing opium poppies and the different synthetic routes during the synthesis of heroin, the presence and concentration of opium alkaloids vary in the final product. Also, during the acetylation of morphine, the other alkaloids of opium may react. Thus, their presence and concentrations vary in the final product. The presence of diluents (mannitol, glucose) and adulterants (caffeine, acetaminophen) provide additional information on the geographical origin of heroin and the way of production. Determining the concentration of opium alkaloids, adulterants and diluents makes chemical «profile» of heroin. Due to the presence of a large number of compounds, the chemical profiles of heroin can be very complex, which further complicates the analysis and profiling of the illicit drug (1-3).
One of the most recent ways to use large amounts of data collected during different chemical analysis of illicit drugs is the application of the chemometric tools suitable for data mining and forensic profiling. In this way, investigators can get insight of the drug production, drug trafficking and geographical origin of the sample. Chemometric approach studies are undoubtedly of a great importance in modern chemistry and biochemistry, especially because of possibility to screen a large number of chemical data (i.e. different molecules or analytes) in a short time and with a low cost (4-7). Multivariate chemometrics is a very useful and powerful tool when the main issue includes dealing with multicomponent data sets (8,9). It allows the extraction of maximum information from complicated datasets. The conclusions in forensic science must be drawn from objective sources as much as possible. The forensic scientists must always follow rigid statistical protocols in the process of making decisions based on experimental data. Hence, our present paper explores the usefulness of multivariate chemometrics with regression and classification approaches in the discrimination of seized heroin samples. The aim of the study was to establish a simple analytical procedure followed by chemometric approach that can recover batch links among limited number of seized heroin samples.

Material and methods
All samples used in this work were seized during various actions by the Serbian police at the border crossings (1) and (2), together with those seized on the territory of Novi Sad municipality (3). Chemical analysis was performed in the laboratories of the National Forensic Technical Center in Novi Sad. All chemicals used (chloroform, pyridine and MSTFA) were pro analysi quality manufactured by Merck.

Analytical procedure.
Heroin samples were homogenized first in a mortar. Mass of 0.15-0.25 g of the sample was quantitatively transferred to vials, together with 200 μL of chloroform + pyridine (1:1) solution in order to dissolve the samples and 200 μL of silylating reagent (MSTFA). Prepared samples are heated for 1 h at 60 °C and then injected in the gas chromatograph with flame ionization detector GC-FID Agilent 6890N. Injected volume was 2 μL and split mode 50:1. Chromatographic separation was achieved on a capillary column DB-1 (length 30 m, internal diameter 0.25 mm, film thickness 0.25 μM). Carrier gas was nitrogen at a pressure of 66.6 kPa. The samples were heated for one minute at 150 °C, then up to 250 °C with heating rate of 10 ºC min -1 . This temperature was maintained for 10 min. The general purpose of MLR analysis is to quantitate the relationship between several independent or predictor variables and a dependent variable. MLR model is built with descriptive variables using the least squares methods to minimize the residuals (13). General MLR model is: where y is the quantitative property to predict (dependent variable), x n an independent (descriptive) variable, a the intercept, and b n the regression coefficient for x n .
HCA is a method for dividing a group of objects into classes so that similar objects are in the same class (cluster). The groups of entities are not known prior to the mathematical analysis and no assumptions are made about the distribution of the variables. Cluster analysis searches for objects which are close together in the variable space. The data in each cluster share some common trait, often proximity according to some defined distance measure (14).
Wald-Wolfowitz run test (WWR) can be applied to examine if two random samples come from populations with the same distribution. WWR test can detect differences in averages or spread or any other important aspect between the two populations (15). This test is efficient when each sample size is greater than or equal to 10 (15). This method includes testing the null hypothesis -H 0 : two samples come from populations having the same distribution. At the start it is necessary to define the critical value for "run" number (r cr ). We can calculate this value based on the following equations (15): r cr = μ -1.96 σ (at 5% level of significance) where: μ = 1 + ((2 n 1 n 2 ) / (n 1 + n 2 )) (3) σ = ((2 n 1 n 2 (2 n 1 n 2 -n 1 -n 2 )) / ((n 1 + n 2 ) 2 (n 1 + n 2 -1))) ½ (4) n 1 -size of sample 1 n 2 -size of sample 2 So-called "run" number (r) can be obtained from the list of n 1 + n 2 observations from two samples in order of magnitude. It represents the number of sections of consecutive values which belong to the same sample and it can be counted from the list of n 1 + n 2 observations. Observations from sample 1 should be denoted as Xs and other as Ys, and then the number of runs can be counted. Afterwards, the r and r cr numbers can be compared. The H 0 hypothesis has to be rejected if r ≤ r cr (15).

Results and discussion
In the first step of the present study, gas chromatography (GC) analyses were applied on thirty eight different samples of heroin from three locations in Serbia. The data of gas chromatography (GC) analyses are summarized in Table 1. as the peak area ratio of four secondary components, namely: acetyl thebaol (TEB), 6-monoacetyl morphine (MAM), papaverine (PAP), noscapine (NOS), and the main psychoactive component of diacetyl morphine (DAM). All these components were identified according to their retention times. In the second step, we focused our efforts on developing the MLR models that can determine the geographical origin of heroin samples. A set of twenty eight collected data (samples 1-28) was used for MLR modeling. TEB/DAM was used as a dependent variable in the regression analysis, and MAM/ DAM, PAP/DAM and NOS/DAM were used as independent variables.
MLR procedure was used to model the relationships between the data of GC analyses. The stepwise regression (SWR) method was used to derive the most significant models as a calibration models for prediction of TEB/ DAM peak ratio of seized heroin samples. The specifications for the best selected MLR models are shown in Table 2.
The statistical quality of the generated models was checked by statistical parameters: correlation coefficient (r), standard error of estimation or standard deviation (s), and F-test (Fisher›s value) for statistical significance (16-18). Correlation coefficient r (or coefficient of multiple determination) is a relative measure of the fit by the regression equation. Correspondingly, it represents the part of variation in the observed data that is explained by the regression. Standard It is well known that there are three important components in any chemometric-regression analysis: the development of the models, validation of the models and the utilization of developed models. Validation is a crucial aspect of any regression analysis (19). For testing the validity of the predictive power of selected models leave one out (LOO) technique was used. The developed models were validated by the calculation of the following statistical parameters: PRESS, SSY, S PRESS , r 2 CV , and r 2 adj (Table 3.). These parameters were calculated from the following equations: where, Y obs , Y calc and Y mean are observed, calculated and mean values; n is number of the samples and p is number of independent parameters.
PRESS is an acronym for prediction sum of squares. It is used to validate a regression model regarding to its predictability. To calculate PRESS, each observation is individually omitted. The remaining n-1 observations are used to calculate a regression and estimate the value of the omitted observation. This is done n times, once for each observation. The difference between the actual Y value, Y obs , and the predicted Y calc , is so-called the prediction error. The sum of the squared prediction errors is the PRESS value. The smaller PRESS is, the better predictability of the model is achieved. SSY are the sums of squares associated with the corresponding sources of variation. These values are in terms of the dependent variable, Y.
The above PRESS value can be used to compute an r 2 CV statistic, called r 2 cross validated parameter, which reflects the prediction ability of the model. This is a good way to validate the prediction of a regression model without selecting another sample or splitting the data. It is very possible to have a high r 2 and a very low r 2 CV . When this occurs, it implies that the fitted model is data dependent. This parameter ranges from below zero to above one. When outside the range of 0-1, it is truncated to stay within this range. Adjusted r-squared (r 2 adj ) is an adjusted version of r 2 . The adjustment seeks to remove the distortion due to a small sample size. In many cases r 2 CV and r 2 adj are taken as a proof of the high predictive ability of MLR models. A high value of these statistical characteristics (>0.5) is considered as a proof of the high predictive ability of the model. However, some recent reports have proved the opposite (20). Although, the low value of r 2 CV for the training set can indeed serve as an indicator of a low predictive ability of a model, the opposite is not necessarily true. Thus, the high value of LOO r 2 CV is the necessary condition for a model to have a high predictive power, but it is not a sufficient one.
Although models showed good internal consistency, they may not be applicable for the analogs which were never used in the generation of the correlation. It is proven that the only way to estimate the true predictive power of a model is to test it on a sufficiently large collection of the samples from an external test set. The test set must include no less than five samples, whose properties and structures must cover the range of properties and structures of the samples from the training set. This application is necessary for obtaining trustful statistics for comparison between the observed and predicted values for these compounds. Therefore, the external extrapolation power of the model was further authenticated by a test set of ten heroin samples.
The values of TEB/DAM peak ratio of an external set of heroin samples (samples 29-38) were calculated by the models. These data are compared with experimentally obtained values of TEB/DAM ratio (Table 4. Figure 1.). From the data presented in Table 4. it is shown that high agreement between experimental and predicted TEB/DAM ratio was obtained (the residual values are small, indicating the good predictability of the established models). According to the reference (16) without the validation of the MLR models by using the external test set, we could not come to a right conclusion about high predictive ability of derived models.
To investigate the existence of a systemic error in developing the MLR models, the  residuals of predicted TEB/DAM peak ratio values were plotted against the experimental values in Figure 2. The propagation of the residuals on both sides of zero indicates that no systematic error exists in the development of regression models as suggested by . It indicates that these models can be successfully applied to predict the geographic origin of seized heroin samples using the GC results. Therefore, the randomness of the residuals and their low values indicate that the obtained mathematical models can predict the dependent variable with acceptable error. According to the Variance Inflation Factor (VIF), which was lower than 10 for all the obtained models, it can be concluded that there is no multi collinearity present in the established models. HCA was performed on the TEB/DAM, MAM/DAM, PAP/DAM and NOS/DAM peak ratios of the analysed heroin samples in order to reveal the similarities among them. Clustering was based on the Euclidean distance and single linkage algorithm. The obtained dendrogram is presented in Figure 3. As it can be seen from the presented dendrogram, on the basis of TEB/ DAM, MAM/DAM, PAP/DAM and NOS/DAM peak ratios, the most similar heroin samples come from border crossing (2) and Novi Sad municipality (3).These entities are placed into the separate cluster, while the samples that belong to border crossing (1) are significantly different than the others. WWR test was applied firstly on the TEB/ DAM, MAM/DAM, PAP/DAM and NOS/ DAM peak ratios together. The established null hypothesis "two samples come from populations having the same distribution" was confirmed for the samples that are seized at border crossing (2) and Novi Sad municipality (Table 5). This result confirms the finding obtained with HCA method. According to WWR test, the samples from Novi Sad municipality and border crossing (1) do not belong to the same population, as well as the samples from border crossings (1) and (2).
Testing the H 0 hypothesis for the examined samples on the basis of TEB/DAM, MAM/ DAM, PAP/DAM and NOS/DAM peak ratios separately, resulted in the same way as in previous WWR analysis, except in the case which included TEB/DAM peak ratio (Table 6). In this case, all the three types of samples do not belong to the same population. It can indicate that exactly TEB/DAM peak ratio can be used as discriminating factor for the analysed samples. As it is shown in the MLR analysis, this ratio actually is dependent variable which is predicted based on the other determined peak ratios.

Conclusion
Collected data were modeled by MLR, HCA and WWR methods. Mathematical dependences that can determine the geographical origin of heroin samples were obtained. The validity of the models has been evaluated by the determination of suitable statistical parameters. Predictive ability of defined mathematical model was tested by comparing and correlating the experimental and calculated values of TEB/DAM peak ratio. The low residual activity and high cross-validated r 2 values (r 2 CV ) indicated the predictive ability of the developed MLR models. Since the correlation was extremely good, our mathematical models can be used to predict geographic origin of seized heroin samples in Serbia, using the GC results. HCA analysis showed that the samples from Novi Sad municipality and border crossing (2) are very similar, while WWR testing explained that mentioned samples belong to the same population according to MAM/DAM, PAP/DAM and NOS/DAM peak ratios. TEB/ DAM peak ratio (the dependent variable in MLR model) was discriminating factor in WWR testing and it showed that the analysed samples do not belong to the same population.